{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.5 Joint distributions and correlations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are often interested not in the distribution of a single variable but in the relationship between two or more variables. This requires us to understand the concepts of **joint distributions** and **correlation**. \n", "\n", "Returning to the BMI dataset, a high BMI is indicative of being overweight and this is likely to mean that an individual may have a high percentage of body fat. Typically, those individuals with high BMI may also be at risk of health conditions such as heart disease, which may be indicated by high blood pressure. \n", "\n", "If we wish to address questions relating to two or more variables, we need to understand their joint distribution.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5.1. Joint distributions\n", "\n", "If we have two random variables $X$ and $Y$, the cumulative joint distribution function (CDF) is, \n", "\n", "$$F(x,y) = P(X \\leq x,Y \\leq y)$$\n", "\n", "regardless of whether $X$ and $Y$ are continuous or discrete. For continuous random variables the joint density function will be $f(x,y)$ and will be non-negative and \n", "\n", "$$\\int_{-\\infty}^{\\infty} \\int_{-\\infty}^{\\infty} f(x,y)\\: dy\\: dx = 1.$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5.2 Marginal distributions\n", "\n", "We might sometimes want to think about the marginal density of, say, $X$. This means we want to know the probability of $X$ irrespective of $Y$, and consequently we will need to integrate over all possible values of $Y$. The marginal cdf of $X$, or $F_X$ is \n", "\n", "$$F_X (x) = P(X \\leq x)$$\n", "$$ = \\lim_{y \\rightarrow \\infty} F(x,y)$$\n", "$$ = \\int_{-\\infty}^{x} \\int_{-\\infty}^{\\infty} f(u,y)\\: dy\\: du$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this, it follows that the density function of $X$ alone, known as the **marginal density** of $X$, is\n", "\n", "$$f_x (x) = F_{X}'(x) = \\int_{-\\infty}^{\\infty} f(x,y)\\: dy$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this is different to assuming that $X$ is independent of $Y$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So what does this mean in practical terms? Returning to the BMI data we can report that the average BMI ($\\mu_X$) is 26.46 and the average body fat percentage ($\\mu_Y$) is 35.31. If BMI and body fat were independent variables knowing BMI would tell us nothing about body fat and *vice versa*. But plotting the data (and some common sense) tells us that this is not the case; if we know one we can say quite a lot about the other. We could explore the correlation between the data (more about this later), but we can also describe these variables together using a joint distribution. By defining them using a joint distribution we are saying nothing about *cause and effect*, just that they are dependent variables. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Cond | Age | Wt | Wt2 | BMI | BMI2 | Fat | Fat2 | WHR | WHR2 | Syst | Syst2 | Diast | Diast2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 43 | 137 | 137.4 | 25.1 | 25.1 | 31.9 | 32.8 | 0.79 | 0.79 | 124 | 118 | 70 | 73 |
0 | 42 | 150 | 147.0 | 29.3 | 28.7 | 35.5 | NA | 0.81 | 0.81 | 119 | 112 | 80 | 68 |
0 | 41 | 124 | 124.8 | 26.9 | 27.0 | 35.1 | NA | 0.84 | 0.84 | 108 | 107 | 59 | 65 |
0 | 40 | 173 | 171.4 | 32.8 | 32.4 | 41.9 | 42.4 | 1.00 | 1.00 | 116 | 126 | 71 | 79 |
0 | 33 | 163 | 160.2 | 37.9 | 37.2 | 41.7 | NA | 0.86 | 0.84 | 113 | 114 | 73 | 78 |
0 | 24 | 90 | 91.8 | 16.5 | 16.8 | NA | NA | 0.73 | 0.73 | NA | NA | 78 | 76 |